The purpose of analysis is to be able to identify the key variables that cause low return recorded in the production of comedy-action-thriller. This will help us to identify the key variables responsible for high and low return on investment from movie production. and finally help make inform descision in genre movie selection.
The report is generated after analysing the IMDB dataset. This datasets contains importants variables of past movies produced by SussexBudgetProductions.We will conduct an Exploratory Data Analysis on the dataset to gain some insight from the dataset and also propose an (hypothesis) explanation for observation recorded from our anlysis.
#importing neccesary manupulative libraries
import numpy as np
import pandas as pd
#Importing some necessary visualisation libraries
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
import seaborn as sns
%matplotlib inline
#importing data set and Reading the dataframe in pandas
data = pd.read_csv('movie_metadata.csv')
data.head() #callin out the first 5 rows
| color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | ... | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | ... | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 1.78 | 33000 |
| 1 | Color | Gore Verbinski | 302.0 | 169.0 | 563.0 | 1000.0 | Orlando Bloom | 40000.0 | 309404152.0 | Action|Adventure|Fantasy | ... | 1238.0 | English | USA | PG-13 | 300000000.0 | 2007.0 | 5000.0 | 7.1 | 2.35 | 0 |
| 2 | Color | Sam Mendes | 602.0 | 148.0 | 0.0 | 161.0 | Rory Kinnear | 11000.0 | 200074175.0 | Action|Adventure|Thriller | ... | 994.0 | English | UK | PG-13 | 245000000.0 | 2015.0 | 393.0 | 6.8 | 2.35 | 85000 |
| 3 | Color | Christopher Nolan | 813.0 | 164.0 | 22000.0 | 23000.0 | Christian Bale | 27000.0 | 448130642.0 | Action|Thriller | ... | 2701.0 | English | USA | PG-13 | 250000000.0 | 2012.0 | 23000.0 | 8.5 | 2.35 | 164000 |
| 4 | NaN | Doug Walker | NaN | NaN | 131.0 | NaN | Rob Walker | 131.0 | NaN | Documentary | ... | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | 7.1 | NaN | 0 |
5 rows × 28 columns
#Checking for row and column in our dataset
data.shape
(5043, 28)
#Checking for categorical and numeric variables in our datasets
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5043 entries, 0 to 5042 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 color 5024 non-null object 1 director_name 4939 non-null object 2 num_critic_for_reviews 4993 non-null float64 3 duration 5028 non-null float64 4 director_facebook_likes 4939 non-null float64 5 actor_3_facebook_likes 5020 non-null float64 6 actor_2_name 5030 non-null object 7 actor_1_facebook_likes 5036 non-null float64 8 gross 4159 non-null float64 9 genres 5043 non-null object 10 actor_1_name 5036 non-null object 11 movie_title 5043 non-null object 12 num_voted_users 5043 non-null int64 13 cast_total_facebook_likes 5043 non-null int64 14 actor_3_name 5020 non-null object 15 facenumber_in_poster 5030 non-null float64 16 plot_keywords 4890 non-null object 17 movie_imdb_link 5043 non-null object 18 num_user_for_reviews 5022 non-null float64 19 language 5031 non-null object 20 country 5038 non-null object 21 content_rating 4740 non-null object 22 budget 4551 non-null float64 23 title_year 4935 non-null float64 24 actor_2_facebook_likes 5030 non-null float64 25 imdb_score 5043 non-null float64 26 aspect_ratio 4714 non-null float64 27 movie_facebook_likes 5043 non-null int64 dtypes: float64(13), int64(3), object(12) memory usage: 1.1+ MB
# Creating a Variable for all numerical variable
data_num=data._get_numeric_data()
data_num.columns.shape
(16,)
Our datasets have 16 numeric variables and 12 categorical variables
#Carrying out a correlation among variables to establish strenght of relationship among varaibles
data_num = data_num.corr()
data_num
| num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_1_facebook_likes | gross | num_voted_users | cast_total_facebook_likes | facenumber_in_poster | num_user_for_reviews | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| num_critic_for_reviews | 1.000000 | 0.258486 | 0.180674 | 0.271646 | 0.190016 | 0.480601 | 0.624943 | 0.263203 | -0.033897 | 0.609387 | 0.119994 | 0.275707 | 0.282306 | 0.305303 | -0.049786 | 0.683176 |
| duration | 0.258486 | 1.000000 | 0.173296 | 0.123558 | 0.088449 | 0.250298 | 0.314765 | 0.123074 | 0.013469 | 0.328403 | 0.074276 | -0.135038 | 0.131673 | 0.261662 | -0.090071 | 0.196605 |
| director_facebook_likes | 0.180674 | 0.173296 | 1.000000 | 0.120199 | 0.090723 | 0.144945 | 0.297057 | 0.119549 | -0.041268 | 0.221890 | 0.021090 | -0.063820 | 0.119601 | 0.170802 | 0.001642 | 0.162048 |
| actor_3_facebook_likes | 0.271646 | 0.123558 | 0.120199 | 1.000000 | 0.249927 | 0.308026 | 0.287239 | 0.473920 | 0.099368 | 0.230189 | 0.047451 | 0.096137 | 0.559662 | 0.052633 | -0.003366 | 0.278844 |
| actor_1_facebook_likes | 0.190016 | 0.088449 | 0.090723 | 0.249927 | 1.000000 | 0.154468 | 0.192804 | 0.951661 | 0.072257 | 0.145461 | 0.022639 | 0.086873 | 0.390487 | 0.076099 | -0.020049 | 0.135348 |
| gross | 0.480601 | 0.250298 | 0.144945 | 0.308026 | 0.154468 | 1.000000 | 0.637271 | 0.247400 | -0.027755 | 0.559958 | 0.102179 | 0.030886 | 0.262768 | 0.198021 | 0.069346 | 0.378082 |
| num_voted_users | 0.624943 | 0.314765 | 0.297057 | 0.287239 | 0.192804 | 0.637271 | 1.000000 | 0.265911 | -0.026998 | 0.798406 | 0.079621 | 0.007397 | 0.270790 | 0.410965 | -0.014761 | 0.537924 |
| cast_total_facebook_likes | 0.263203 | 0.123074 | 0.119549 | 0.473920 | 0.951661 | 0.247400 | 0.265911 | 1.000000 | 0.091475 | 0.206923 | 0.036557 | 0.109971 | 0.628404 | 0.085787 | -0.017885 | 0.209786 |
| facenumber_in_poster | -0.033897 | 0.013469 | -0.041268 | 0.099368 | 0.072257 | -0.027755 | -0.026998 | 0.091475 | 1.000000 | -0.069018 | -0.019559 | 0.061504 | 0.071228 | -0.062958 | 0.013713 | 0.008918 |
| num_user_for_reviews | 0.609387 | 0.328403 | 0.221890 | 0.230189 | 0.145461 | 0.559958 | 0.798406 | 0.206923 | -0.069018 | 1.000000 | 0.084292 | -0.003147 | 0.219496 | 0.292475 | -0.024719 | 0.400594 |
| budget | 0.119994 | 0.074276 | 0.021090 | 0.047451 | 0.022639 | 0.102179 | 0.079621 | 0.036557 | -0.019559 | 0.084292 | 1.000000 | 0.045726 | 0.044236 | 0.030688 | 0.006598 | 0.062039 |
| title_year | 0.275707 | -0.135038 | -0.063820 | 0.096137 | 0.086873 | 0.030886 | 0.007397 | 0.109971 | 0.061504 | -0.003147 | 0.045726 | 1.000000 | 0.101890 | -0.209167 | 0.159973 | 0.218678 |
| actor_2_facebook_likes | 0.282306 | 0.131673 | 0.119601 | 0.559662 | 0.390487 | 0.262768 | 0.270790 | 0.628404 | 0.071228 | 0.219496 | 0.044236 | 0.101890 | 1.000000 | 0.083808 | -0.007783 | 0.243487 |
| imdb_score | 0.305303 | 0.261662 | 0.170802 | 0.052633 | 0.076099 | 0.198021 | 0.410965 | 0.085787 | -0.062958 | 0.292475 | 0.030688 | -0.209167 | 0.083808 | 1.000000 | 0.059445 | 0.247049 |
| aspect_ratio | -0.049786 | -0.090071 | 0.001642 | -0.003366 | -0.020049 | 0.069346 | -0.014761 | -0.017885 | 0.013713 | -0.024719 | 0.006598 | 0.159973 | -0.007783 | 0.059445 | 1.000000 | 0.025737 |
| movie_facebook_likes | 0.683176 | 0.196605 | 0.162048 | 0.278844 | 0.135348 | 0.378082 | 0.537924 | 0.209786 | 0.008918 | 0.400594 | 0.062039 | 0.218678 | 0.243487 | 0.247049 | 0.025737 | 1.000000 |
sns.heatmap(data_num)
<AxesSubplot:>
#using the profile report method to get the summary statistics of our dataset
data.profile_report()